With the development of conference systems, people are no longer satisfied with basic communication of conference contents, and pursue comfort and on-the-scene feelings in communications. A telepresence conference system developed for such a demand, e.g. a telepresence system, which is an immersive virtual conference technology, is able to provide a good on-the-scene feeling for a conferee, including a life size, an eye contact and location identification by voice, and so on, thus creating a face-to-face feeling between two communicating parties. A life-size image can be hardly provided by one camera plus a conference terminal display in each conference place according to the prior art. Therefore, for a current telepresence system, a plurality of cameras are generally arranged in each conference place and support simultaneous display of a plurality of video streams. FIG. 1 is a schematic diagram of an arrangement of a telepresence conference place with three displays, including three displays 101 arranged side by side, telepresence desks 102 and telepresence chairs 103 provided opposite to the displays 101, and so on. Life-size display and an on-the-scene perspective can be achieved through the arrangement, thus providing voices and images which are more realistic with more on-the-scene feelings. Since a terminal is assorted with a display in a certain conference place of a common conference system, all original scrolling captions can be displayed on one display in a scrolling manner. When a telepresence conference place is provided with a plurality of displays, an original solution can only control a scrolling caption or an image to be presented on a certain screen, which is not user-friendly. Better user experience will be achieved if a caption or an image can scroll on a plurality of displays with a scrolling effect that the plurality of displays are integrated as a whole.